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MIDI encoding and decoding 

This invention relates to a method of composing and a method of 
decomposing a multimedia signal according to the Musical Instrument Digital 
Interface (MIDI) specification. According to the MIDI specification, the 
5 multimedia signal carries a description of a musical composition by means of 
events of a firet type that are an^nged to carry instructions to a unit which of 
patches to use for playback, which of notes to play, and at which of sound 
levels to play each of the notes. Optionally, the MIDI specification allows use 
of events of a second type, which are an^nged to carry additional content. 

10 

Additionally, the invention relates to a unit of composing a multimedia signal 
and a unit of decomposing a multimedia signal and a multimedia signal. 

The Musical Instrument Digital Interface (MIDI) protocol provides a 

15 standardized and efficient means of conveying musical performance 
infomriation as electronic data. MIDI infonnation is transmitted in 'MIDI 
messages', which can be thought of as instructions that tell a music 
synthesizer how to play a piece of music. The synthesizer receiving the MIDI 
datei must generate tiie actual sounds. TTie sounds are generated from 

20 predefined, sounds eg sampled and stored in wave tables. A wave table 
defines musical Instruments and contains audio samples of the musical 
instruments. In connection herewith, an instrument map is a collection of 
instrument names, where each instrument name is associated with a 
number, 0-127, also known as a program number. Thus, the instrument map 

26 itself does not contain information about how an instrument sounds. 
Additionally, the instrument map can specify less than 128 instruments. 
Moreover, a so-called patcji Is an alternative name for a program and means 
a specific instrument (refened to via a number, 0-127) or a specific drum-kit. 
The general MIDI specification defines a standard set of instruments 

30 comprising 128 instalments e.g. a piano, a flute, a trumpet, different drums 
etc. The MIDI Detailed Specification published by the MIDI Manufacturers 
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Association, Los Angeles, CA, provides a complete description of the MIDI 
protocol. 

The MIDI protocol was originally developed to allow musicians to connect 
5 synthesizers together, the MIDI protocol is now finding widespread use as a 
delivery medium to replace or supplement digitized audio in games and 
multimedia applications. There are several advantages to generating sound 
with a MIDI synthesizer rather than using sampled audio from disk or CD- 
ROM. The first advantage is storage space. Data files used to store digitally 
10 sampled audio in Pulse Code Modulation (PCM) format (such as .WAV files) 
tend to be quite large. This is especially true for lengthy musical pieces 
captured in stereo using high sample rates. 

MIDI data files, on the other hand, are extremely small when compared with 
15 sampled audio files. For instance, files containing high quality stereo sampled 
audio require about 10 Mbytes of data per minute of sound, whereas a typical 
MIDI sequence might consume less than 10 Kbytes of data per minute of 
sound. This is because the MIDI file does not contain the sampled audio 
data; it contains only the instructions needed by a synthesizer to play the 
20 sounds. These instructions are in the form of MIDI messages that instruct the 
synthesizer e.g. which patches to use, which notes to play, and how loud to 
play each note. The actual sounds are generated by the synthesizer. 
Other advantages of using MIDI to generate sounds include the ability to 
easily edit the music, and the ability to change the playback speed and the 
25 pitch or key of the sounds independently. 

The recipient of this MIDI data stream is commonly a MIDI sound generator 
or sound module, which will receive MIDI messages at its MIDI IN connector, 
and respond to these messages by playing sounds. 

30 
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MIDI files contain one or more MIDI streams, with time information for each 
event. The event can be a regular MIDI command or an optional META event 
which can canry information of lyrics, and tempo. 'Lyrics' and Tempo' are 
examples of such META events. Lyric, sequence, and track structures, 
5 tempo and time signature information are all well supported. In addition, track 
names and other descriptive information may be stored with the MIDI data as 
META events. 

MIDI files are made up of chunks. A MIDI file always starts with a header 
10 chunk followed by one or more track chunks. Basically, a chunk comprises a 
value indication the size of the chunk'and a series of messages. 

This structure of the MIDI protocol allows for a very efficient representation of 
the instrumental portion of a musical composition due to the utilization of 
15 predefined sounds for notes of instruments used in the composition. 

However, often vocal song or vocals is an appreciable portion of a musical 
composition. The MIDI protocol happens to be insufficient for handling a 
vocal song or vocals or a vocal song or vocals portion of a musical 
20 composition. An explanation to this insufficiency is that vocal song or vocals 
can not be represented by playing of tones from a relevant MIDI map. 

From a memory consumption point of view, a musical composition can be 
sampled, typically by use of Pulse Coded Modulation, compressed by coding 

25 for efficient storage, and decoded for the purpose of reproduction or 
playback. Typical encoding/decoding schemes comprise MPS, which is the 
MPEG layer 2 (MPEG = Moving Picture Expejts Group); AMR (Adaptive 
Multi Rate); and AAC (Adaptive Audio Codec). However, whether in 
compressed or uncompressed form, a sampled musical composition will not 

30 provide access to the protocol according to which the composition is stored 
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for manipulation of individual notes of the musical composition and how they 
are played since this Infonmation is lost during sampling. 

Thus, there exists no efficient way for the combined storage of the vocal song 
6 or vocals portions and instrumental portions of a musical composition. 

This problem is solved when the method mentioned in the opening paragraph 
comprises the steps of generating the multimedia signal by inserting events 
of the second type and by applying additional content to events of the second 
10 type, wherein the additional content comprises addresses of encoded 
samples of sampled multimedia content. 

Consequently, e.g. a MIDI representation of a musical composition can also 
provide efficient means of conveying vocal song or vocals or other audio 

15 performance. Since, according to the invention, information of vocal song or 
vocals is conveyed by means of events which typically are dedicated to other 
purposes than determination of which instrument patches to use, which 
instrument notes to play, and which sound level to play an instrument note at, 
the representation of the musical instrument performance will not be 

20 corrupted. The additional content of the events conveying the vocal song or 
vocals perfonmance comprises an address to the encoded samples of the 
sampled multimedia content, which may comprise the vocal song or vocals 
perfomnance. Thereby, the encoded samples may be located either inside (cf 
inline) or outside (external) a signal carrying the MIDI presentation; this signal 

25 may be denoted a multimedia signal. Preferably, the encoded samples are 
outside the multimedia signal. Thereby, the multimedia signal which is a MIDI 
signal is not loaded with the load of the encoded samples. Despite being 
compressed it may convenient to handle the encoded samples at a location 
external to the MIDI signal. Apparatuses reading the MIDI signal and which 

30 do not support reproduction of the vocal song or vocals perfomnance are 
thereby not loaded with the coded samples. 
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In a preferred embodiment, the method additionally comprises the step of 
inserting samples of the first type. This allows for composing the multimedia 
signal from sources of MIDI and vocal song or vocals/audioA/ldeo that 
5 supplies content in simultaneous streams. Alternatively, the multimedia signal 
can be composed of MIDI and vocal song or vocals/audioA/ideo content 
stored in Random Access Memory types. 

Preferably, the method comprises the step of inserting a delta-time value 
10 before each of the events of the second type, wherein the delta-time value 
represents a point in time at which to begin playback of the sampled 
multimedia content. This use of delta-time values allovt^ for specifying 
precisely at which deltei-time instant a given portion or part of the encoded 
vocal perfomnance is to be played. Thereby synchronization means is 
15 provided to synchronize the musical and the vocal parte of a composition. 
When the multimedia signal is being composed a delta-time counter can be 
utilized to obtain a time-stamp for use in inserting a delta-time value before 
an event of the second type, which cam'es a reference to the vocal 
pertormance. Thereby, the composition of the musical part and the vocal part 
20 of the multimedia signal can utilize a common delta-time counter. 
Alternatively, the vocal part can be composed with delta-time values made 
relative to delta-ti'me values in an existing file or stream of event of the first 
type, which cam'es the musical part. 

25 As mentioned in the introduction, the invention also relates to a method of 
decomposing a multimedia signal wherein the method comprises the steps of 
parsing the signal to identify events of the second type and to read the 
additional content: loading coded samples of multimedia content at an 
address specified in the additional content; and decoding the coded samples 

30 to provide decoded samples for playback of tiie multimedia content. 
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In preferred embodiments, the events of the second type comprise System 
Exclusives events as defined in the specification of the Musical Instrument 
Digital Interface (MIDI). System Exclusives events, also referred to as so- 
called sysex events, are defined to be associated with a manufacturers own, 
5 centrally issued and registered identification number. A normal, complete 
System Exclusive event is stored as four componeniB; a first being an 
identifier with the hexadecimal value 'FO', a second being a hexadecimal 
value of the number of bytes to be transmitted after 'FO', a third being the 
additional content, and a fourth being a terminator with the hexadecimal 
10 value 'F7'. According to the Invention, the additional content comprises the 
address at which to retrieve the coded audio data. 

\Artien the events of the second type comprise Meta-events as defined in the 
specffication of the Musical Instrument Digital Interface (MiDI), additional 
15 possibilities of representing a musical composition is provided. 

in preferred embodiments the events of the second type comprise Meta- 
events of the type cue-points, identified by the HEX value FF 07. A cue-point 
event comprise three components; a first being an identifier with the 
20 hexadecimal value 'FF 07', a second being a hexadecimal value of the 
number of bytes to be transmitted after 'FF 07', and a third being the 
additional content. 

Preferably, the events of the second type comprise Meta-events of the type 
25 lyric, identified by the hexadecimal value FF 05. 

Preferably, the events of the second type comprise Meta-events of the type 
text, identified by the hexadecimal value FF 01 . 

30 When an address indicates a position in a first file associated with the 
multimedia signal an increased flexibility of distributing the multimedia signal 
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is obtained in that the file can contain multiple chunks of coded samples that 
can be addressed individually. Additionally, a specific chunk of coded 
samples can be addressed more than one time in a signal. This can result is 
reuse of the coded samples and thus further compression of the multimedia 
. 5 content. The address can indicate byte counts or positions in the file or 
frames or chunks numbers in the file. Additionally or alternatively, the 
address can comprise a Unified Resource Locator (URL) which can point to 
local or remotely stored files. 

10 According to a preferred embodiment, the multimedia signal is stored fn a 
second file. This second file can be a standard MIDI file. Preferably, the first 
file and the second file are embedded in a common file container which 
allows for efficient transfer of the files. 

15 The additional content may comprise an indication of the type of the coding 
scheme used for encoding the encoded samples. Thereby, it is possible to 
select one of multiple encoding/decoding schemes e.g. as a consequence of 
new and improved schemes being developed or in order to be able to select 
one scheme detemnined to be the most efficient schemes among other 

20 schemes. 

The invention will be explained in more deteil with reference to the drawing In 
which: 

fig. i illustrates a unit for composing a multimedia signal; 
25 fig. 2 illustrates a unit for decomposing a multimedia signal; 
fig. 3 illustrates a file container; 

fig. 4a illustrates the structure of an event-based multimedia signal combined 
with coded audio data and event-based references to the coded audio data; 
fig. 4b illustrates the structure of an event-based multimedia signal; 
30 fig. 4c illustrates the structure of coded audio data and event-based 
references to the coded audio data; 
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fig. 5 shows a flowchart of a method of composing a multimedia signal; 
fig. 6 shows a flowchart of a method of decomposing a multimedia signal; 
fig. 7a illustrates a schematic envelope of a multimedia signal; and 
fig. 7b illustrates temporal aspects of MIDI events, coded audio events and 
5 samples of a playback signal. 

Fig. 1 illustrates a unit for composing a multimedia signal. The unit 100 
comprises two main signal paths; a first, via which MIDI messages from the 
OUT port of a MIDI generating device, eg a keyboard or another instrument, 
10 is provided; and a second, via which sampled audio is received, encoded, 
and stored, and wherein instructions to an audio or speech decoder are 
inserted. 

The first signal path of the unit 100 comprises a MIDI IN port 104 via which 
15 signals or files in accordance with the. MIDI specification can be received. 
These signals are passed on to a merger 105 where the signals received on 
the port 104 are merged with signals provided via the second signal path. 
The signals provided via the first path comprise MIDI messages including 
MIDI events and optionally MIDI headers and other well-known MIDI 
20 information. It should be noted that another term for meiiger may be adder. 

The second signal path of the unit 100 comprises a sampler 101 for sampling 
audio signals and/or video signals to provide sampled audio or video signals. 
Thus, these samples can represent a multimedia content which may 

25 comprise audio and/or vkJeo. Typically, audio signals are in the frequency 
range 20Hz-20 KHz; audio signals conveying vocal song or vocals 
performance only are in the frequency range about 100Hz-5 KHz. In an 
alternative embodiment, the sampler 101 is replaced by an input port 
arranged to receive sampled audio and/or sampled video, e.g. Pulse Code 

30 Modulated samples. 
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The sampled audio/video signals are sent to an encoder 102, by means of 
which the sampled audio/video signals are encoded to a compressed format. 
Thus, a first output from the encoder is a compressed format file or signal or 
more generally, data. This file or signal is stored in a sample bank 106, 
wherefrom the compressed format file or signal can be retrieved for 
subsequent decoding. A second output from the encoder comprises an 
address of the compressed format file or signal. The first output can be 
generated by means of well-known encoding schemes such as, for audio: 
MP3, which is the MPEG1 layer 3 (MPEG = Moving Picture Experts Group); 
AMR (Adaptive Multi Rate); and AAC (Adaptive Audio Codec), and for video: 
MPEG-4 video coding, which is a so-called block-based predictive differential 
video coding scheme. The address in the second output is generated by 
registering where the compressed fonnat file or signal is stored. The address 
can for instance specify that the compressed format data is stored In the 
address range 0000 (HEX) to 00B7 (HEX). 

Based on the stored compressed format file and the address thereof, an 
event inserter 103 is arranged to generate an event in accordance with the 

•I 

MIDI spectfrcation. The event can be of the System Exclusives (Sysex) type 
20 or Meta type as defined in the MIDI specification. The address is inserted 
after an Indication of the type of event and after an indication of the number 
of bytes to follow. 

According to the MIDI specification, the syntax for a system exclusives everit 
25 is the following: FO <length> <bytes to be transmitted after F0>. Here, FO Is 
an identifier identifying the type of the event being a Sysex event The 
identifier is followed by a field <length> with a value indicating the length in 
bytes of the following bytes of the event. The field <bytes to be transmitted 
after F0> is also denoted additional content in the context of the present 
30 invention. In this latter field information for addressing the compressed fomiat 
data and any other information is placed. 
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A very simplistic example of the use of System Exclusives could appear like 
the following fragment of an event according to the MIDI specification and in 
accordance with an aspect of the invention: 

5 

64 

FO 09 7D 7F XX XX 00 00 00 B7 

In the first line 64 HEX indicates that the following event is to be executed at 
10 a delta-time of 100 ticks with a specified ticks-duration. In the second line FO 
indicates start of a system exclusives event. 09 HEX indicates the number of 
bytes succeeding the FO code. At the following position, code 7D indicates 
that the event is for research use, and hence not occupied by a specific 
manufacturer of MIDI equipment Thereby, the code 7D can be used in 
15 accordance with the present invention. At the following position, 7F indicates 
that all devices are used, however, a specific device can be used by writing 
a respective device ID of the device to use. At the next two positions, 
indicated by xx xx it is possible to steite sub IDs for the device stated at the 
preceding position. Subsequently, the , 00 00, indicates a start frame and 00 
20 B7 indicates a stop frame. 

For META events of the cue-point type the syntax is the following: FF 07 
<length> <text>. Here, FF 07 is an identifier identifying the type of the event 
The identifier is followed by a field <length> with a value indicating the length 
25 In bytes of the following bytes of the event. The field <text> is also denoted 
additional content in the context of the present invention. In this latter field 
information for addressing the compressed fomiat data and any other 
Information is placed. 

30 Thus, a conresponding example for a META event of the cueiDoint type could 
appear like the following fragment: 
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64 

FF 07 05 00 00 00 B7 

5 Again, the first line 64 HEX Indicates tliat the following event is to be 
executed at a delta-time of 100 ticks, in the second line FF 07 indicates start 
of a META event of the cue-point type. 05 indicates the length of the event 
with the additional content 00 00 00 B7, wherein 00 00, indicates a start 
frame and 00 B7 indicates a stop frame. In ASCII representation the above 
1 0 line starting with HEX FF Is: 

255 7 5 0 0 0 183 

This representation may be preferred instead of the HEX representation. 

15 

Turning back to the unit 100, the function of the adder 105 is to merge the 
signals comprising events provided from the first and the second signal path. 
This Is cam'ed out by merging the signals such that the output from the adder 
comprises events each preceded by delta-time stamps, which occur in either . 
20 ascending or descending order. 

Fig. 2 illustrates a unit for decomposing a multimedia signal. The unit 200 
comprises, a parser 201 that is arranged to split a signal in accordance with 
the MIDI specification into two signals. In a first embodiment the parser is 

25 based on identifying events of a second type which are identifiable separately 
from events of a first type. The events of the second type can be events 
Identified by a given value or bit-pattem. Thus events of the first type can be 
identified as events not being of the second type. Any delta-time stamps 
preceding an event are split to follow the succeeding event Events of the first 

30 type are then output on a port A and events of the second type are output on 
port B. 
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In a second, alternative, embodiment, the parser is arranged to pass on all 
events to port A, while making a copy of events and their preceding delta- 
time which are determined to be of the second type. 

5 

in a third, alternative, embodiment, the parser Is arranged to remove a 
portion of the additional content that fulfils a given criterion before sending 
the otherwise intact signal to port A. The portion of the additional content that 
fulfils a given criterion is fonA/arded to port B with the events of the second 
10 type that comprises the Identified additional content and any preceding deita- 
tlme value. 

Output on port A of the parser 201 is sent to a synthesizer 202, wherein the 
received MIDI signal is interpreted to make an analogue or digital 
15 reproduction of Vr\e musical composition described by the MIDI signal. 

Output on port B of the parser 201 is sent to an interpreter 203, wherein 
additional content is interpreted together with a delta-time value preceding 
the event that was conveying the additional content. This interpretation 

20 comprises a determination of the address at which to retrieve the 
compressed format file that it is intended to play at the time instance set by 
the delta-time value. Optionally, the Interpreter can identify the type of coding 
scheme used to encode the compressed format file by reading information 
indicative thereof, if present. Based on the determined address the thereby 

25 referenced portion of the compressed format file is retrieved from the sample 
bank 106 via the interface 204. The retrieved portion is sent to a decoder 
205, wherein the coded samples are decoded to provide a signal that can be 
mixed with the analogue or digital reproduction provided from the synthesizer 
202. The signals are mixed by means of adder 208 providing a mixed signal 

30 for playback by means of an amplifier 207 and a loudspeaker 209. In order to 
achieve synchronisation between the two signals provided to the adder 208 
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synchronization block 210 is provided. This synchronization block can be 
implemented by controlling the operation of the synthesizer 202 relative the 
decoder 205 or vice versa. However, the synchronization can be 
implemented in other ways. 

It should be noted that the term 'referenced portion of the compressed format 
file' also can be denoted a Compressed Audio Block, CAB; Compiessed 
Video Block, CVB; or Compressed l\/1uitimedia Block CMB. 

Fig. 3 illustrates a file container. The file container 301 comprises a MIDI file 
302 and a coded audio file 303. Optionally, or aitemativeiy, the file container 
can comprise a coded video file 304. The coded audio file 303 and/or the 
coded video file 304 is referred to, in the above, as the sample bank 106. By 
means of the file container 301, a complete musical composition with an 
instrumental portion and a vocal song or vocals portion can be distributed as 
a single file. The coded audio file 303 can comprise multiple Compressed 
Audio Blocks. It should be clear that the components in the container may be 
interleaved to facilitate a suitable format for streaming. 

Fig. 4a illustrates the structure of an event-based multimedia signal 
combined with data in compressed audio blocks, compressed video blocks or 
compressed multimedia blocks and event-based references to the blocks. 
The event-based multimedia signal 401 comprises events of the above 
mentioned first type 407 (event-1) and the second type 407 (event-2). The 
structure 401 illustrates the structure of a signal provided by the adder 105. 
and a signal received by the parser 201. In the second, altemative, 
embodiment of the parser 201 the structure also represents the signal as 
provided on port A of the parser. 

The coded audio data 402 comprises blocks 403 and 404 of coded audio. 
These blocks are addressed by event-based references 410 which are 
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embedded in the content of an event 406 of the second type. The delta-time 
stamp DT preceding an event, detemnines the point in time at whic^ to start 
playback of a respective blocic of coded audio. 

5 Fig. 4b illustrates the structure of an event-based multimedia signal. The 
structure 408 illustrates a MIDI signal wherein events 407 of the first type 
only are present. Hence, there are no references to coded audio or video. 

Fig. 4c illustrates the structure of coded audio data and event-based 
10 references to the coded audio data. The structure 409 comprises events 406 
of the second type each with a reference to coded audio or video. 

Fig. 5 shows a flowchart of a method of composing a multimedia signal. The 
methods starts in step 501 and proceeds to step 502 wherein a counter 

15 counting units of time is started; the counter is denoted a delta-time counter. 
Subsequently, in step 503 it is examined whether a received event Is either a 
MIDI signal or an audio/video signal. If no events are detected, the method 
will continue examining whether an event is received until an event is 
received. In the latter-mentioned case, the method will proceed to step 504, 

20 wherein it is examined whether the detected event is either an event that 
represents arrival of a MIDI event or an event (CAB) representing start or 
stop of the transmission of a coded audio block. 

In case a MIDI event arrives, a delta time for the MIDI event is inserted. 
25 Subsequently, the MIDI event Is inserted in step 505 into the multimedia 
signal which is being composed. 



In case a block of audio/video starts being received or tenninates being 
received, it is determined whether the block starts or stops. In case the block 
30 starts being received, a delta-time stamp is generated in step 507 based on 
the count of the delta-time counter. In step 508, a meta-event is generated. 



15 

Since the complete address of the block of audio/video may not be known a 
pointer is set to the generated meta-event. Subsequently, streaming of the 
coded audio block to file storage is started. The file storage may be an audio 
file in a file container. 

5 

In case a block of audio/video terminates being received, the meta-event 
referenced by a pointer set in step 608 is updated with any remaining 
address information to provide complete information for accessing the stored 
data. Subsequently, the process of streaming the coded audio block to the 
1 0 file is terminated in step 51 1 . 

When the steps 508 or 509 or 511 have been completed, the method 
resumes at step 503 to examine whether any events are being received. • 
However, as an option it can be examined in step 512 whether to stop the 
15 method. However, it should be avoided stopping the method during the 
process of streaming data to the coded audio data. 

Fig. 6 shows a flowchart of a method of decomposing a multimedia signal 
The method starts in step 601, wherefrom the method proceeds to step 602 

20 to parse a received MIDI file or signal, in subsequent step 603 events of the 
MIDI file or signal are selected one-by-one and their type is detenmined. The 
events can be MIDI events conveying instrumental musical performance or 
META events conveying Information as set out in the MIDI specification 
and/or information for locating coded audio data. In step 604 MIDI events are 

25 passed on to step 605 and META events are passed on to step 606. 

In step 605 MIDI events are executed in a synthesizer to provide a 
reproduction of the instrumental portion of a composition or alternatively, 
transmitted to a synthesizer. 
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In step 606 events determined to be of the META event type with any 
additional content are interpreted to deduce eg an address and/or a filename 
at whicli coded audio data are located. In step 607, loading of coded audio 
samples is started and continues wtiiie in a range specified by the address. 
5 After step 607 a route 'a)' indicates a first embodiment while route 'b)' 
indicates a second embodiment. According to the a) route, decoding of 
coded audio samples is started in step 608. In order to ensure 
synchronisation between sound produced by the synthesizer 605 and the 
coded audio, synchronisation is started in step 609 before and maintained 
10 during playback of the decoded samples in step 610. According to the b) 
route, the addressed, coded audio samples are sent to a decoder in step 61 1 
for subsequent playback. 

Fig. 7a illustrates schematic envelopes of a multimedia signal. The envelopes 
15 are depicted as a function of time t. The envelope 701 represents musical 
composition with duration of typically 2.5 to 10 minutes. The musical 
composition comprises, for illustrative purpose, four portions A1, B, C, and 
A2 of vocal song or vocals. 

20 In a first embodiment, the vocal song or vocals portions can be encoded in a 
single and continuous block of data as illustrated by the arrow 706. 

In a second embodiment, the vocal song or vocals portions can be encoded 
in several blocks of data, as illustrated by arrows 707. The blocks can be 
25 arranged temporally to cover only the parts where vocal song or vocals is 
appreciated. Each block is represented in MIDI by means of a delta-time 
stamp and a meta-event with additional content for addressing the block in 
storage memory. 

30 In a third embodiment, the blocks can be arranged temporally to cover vocal 
song or vocals fractions conresponding to fractions of the lyric that are sung. 
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Thereby, coded samples of a fraction of a vocal song or vocals are contained 
In a block. If fractions of a vocal song or vocals are repeated for instance 
three times, these three fractions can be reproduced by pla^ack of the same 
fraction. Additionally, since the duration of pauses between spoken words 
accounts to up to about 66% of a speech or song, playback using multiple 
reproductions of even single words can be efficient. Thereby, further 
achievements in compression of a multimedia signal are obtained. 

Fig. 7b illustrates temporal aspects of MIDI events, coded audio events and 
samples of a playback signal, it is illusfarated that samples are reproduced 
periodically at even points in time e.g. at a sample rate of 44,1 or 48 KHz. At 
a less frequent rate MIDI events 711 I.e. events of the first type, occur in a 
MIDI file with information on which of patches to use for playback, which of 
notes to play, and at which of sound levels to play each of the notes. The 
playback of fiie individual notes is defined in the events and can result in 
simultaneous playbadc of different notes, overlapping playback etc. This 
depends on the information in the events and can Include atteck, decay,' 
sustain, and fade durations. 

For events wth Information according to the invention and at a rate 
determined by the size of the coded audio blocks as discussed above, META 
events 710, i.e. evente of the second type, occur. These events determine 
the playback of the vocal song or vocals performance and may result in 
simultaneous or overlapping playback of coded audio blocks - or in 
consecutive playback as illustrated in fig. 7a. 

Generally, track chunks are where actual vocal song or vocals data is stored. 
Each chunk is simply a stream of MIDI events preceded by delta-time values. 
The syntax is the following: 

<track chunk> - <length> <M event> + 
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Wherein the plus sign "+* indicates that several of the fields <M event> 
typically will occur. 

5 The syntax of an M event is very simple: 

<M event> - <delta time> <event> 

Here, <delta-time> is stored as a variable length quantity. It represents the 
10 amount of time before the following event If the first event in a track occurs 
at the very beginning of a track, or if two events occur simultaneously, a 
delta-time of zero is used. Delta times are always present in standard MIDI 
file. Delta-time is in ticks as specified by the header chunk. 

15 <event> = <MIDI event> | <sysex event> | <meta-event> 

Here, it Is indicated that the field <event> can be any one of the types <MIDI 
event> or <sysex event> or <meta-event>. 

20 The field <MIDI event> contains any MIDI channel message. 

The field <sysex event> is used to specify a MIDI system exclusive message, 
either as one unit or in packets, or as an 'escape' to specify any arbitrary 
bytes to be transmitted. According to the invention, Sysex events can convey 
26 information in the form of direct or indirect addresses or instructions to control 
playback of coded audio or video. It should be noted that the so-called multi- 
packet aspect of the sysex event is applicable within the scope of the 
invention. 

30 The field <meta-event> comprises meta-events of the type 'Cue points' with 
the syntax FF 07 <length> <text>, wherein the field <text> can convey the 
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additional information according to the invention. Specific types of cue points 
can refer to individual event occurrences; each cue number may be assigned 
to a specific reaction, such as a specific one-shot sound event. The specific 
one-shot event can be to decode a specific CAB, CVB, or CMB. In this case, 
5 the specific block can be associated with a specified event number. 

Additionally the field <meta-event> comprises meta-events of the type 'Lyric' 
with the syntax FF 05 <length> <text> and 'text event" with the syntax FF 01 
<length> <text> wherein the fields <text> can convey the additional 
1 0 information according to the invention. 

Generally, it should be noted that the invention is not limited to the Musical 
Instrument Digital Interface (MIDI). Advantages of the present invention can 
be obtained for all types of files or streams of data where events carry at. 

15 least a partial representation of content in a composition e.g. in the form of a- 
multimedia signal — especially an audio signal. IHere, events are associated, 
with information of at which temporal instance to reproduce a specified vocals 
and/or musical and/or video and/or other multimedia performance:;. 
Preferably, however, the Invention is especially advantageous with any 

20 protocol that operates relative to a type of time line and a type of meta 
events, in fact, a 3GP container used in 3GPP can be attached with text files 
along the time line, where the text file carries information for reproducing a 
muKimedia performance and/or addresses/pointers to such information. 

26 Additionally, it should be noted that the invention i.e. is explained in 
connection with MIDI and musical and/or vocal performance. The term 
'multiniedia' and/or 'multimedia signal' and/or 'multiniedia perfonnance' 
comprises 'audio' and/or 'audio/signal' and/or 'audio performance', 
respectively, where audio comprises music and/or vocals. 
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Finally, it should be noted that the meta control of audio, song or vocals 
according to the present invention allows to point to any file at any location 
within the file. Thereby, an efficient and flexible representation of a musical 
composition in combination with song, speech, vocals, or other audio content 
is provided. 
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CLAIMS 

1. A method of composing a multimedia signal (401;40g) according to a 
protocol using an event controlled representation of contents in the 
multimedia signal where the signal is composed to carry: 

events (407) of a first type which are arranged to carry content in the form of 
instructions to a unit; and 

events (406) of a second type which are an-anged to cany additional content 
(410): 

wherein the method comprises the following steps: 

generating the signal (401;409) by inserting (508) events (406) of the second 
type and by applying (510) additional content (41 0) to everi^ (406) of the 
second type, wherein the additional content (410) comprises addresses of 
encoded samples of multimedia content (402) or encoded samples of 
multimedia content (402). 

2. A method according to claim 1 , wherein the method comprises the step 
(505) of inserting events (407) of the first type. 

3. A mettiod according to claim 1 or 2, wherein the method comprises the 
step (507) of inserting delta-time values before the events (406) of the 
second type, wherein the delta-time value represents a point in time at which 
to begin playback of the sampled multimedia content. 

4. A method of rendering a multimedia signal according to a protocol using 
an event controlled representation of content in the multimedia signal where 
the signal (401 ;409) is composed to carry: 

events (407) of a first type which are arranged to carry content in the form of 
instructions to a unit; and 

events (406) of a second type which are anranged to carry additional content 
(410); 

wherein the method comprises the following steps: 
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parsing (602) the signal (401;409) to identify events (406) of the second type 
and to read the additional content (410); 
loading (607) encoded samples of multimedia content (402); and 
decoding (61 1) the encoded samples to provide decoded samples for 
5 playback of the multimedia content. 

5. A method according to claim 4, wherein the additional content specifies an 
address wherefrom the encoded samples are loaded. 

10 6. A method according to claim 4, wherein the additional content comprises 
the encoded samples. 

7. A method according to any of claims 4 to 6, wherein the unit renders an 
output signal in response to the events of the first type, and wherein the 

15 decoded samples are superimposed on the first signal in accordance with 
delta-time values of the events. 

8. A method according to any of claims 1 to 7, wherein the events (406) of 
the second type comprise System Exclusives events as defined in the 

20 specification of the Musical Instrument Digital Interlace (MIDI). 

9. A method according to any of claims 1 to 8, wherein the events (406) of 
the second type comprise Meta-events as defined in the specification of the 
Musical Instrument Digital Interface (MIDI). 

26 

10. A method according to claim 8, wherein the events (406) of the second 
^ type comprise Meta-events of the type cue-points. Identified by the 

hexadecimal value FF 07. 
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1 1. A method according to claim 8, wherein the events (406) of the second 
type comprise Meta-events of the type lyric, identified by the hexadecimal 
value FF 05. 

5 12. A method according to claim 8, wherein the events (406) of the second 
type comprise Meta-events of the type text, identified by the hexadecimal 
value FF 01. 

13. A method according to any of claims 1 to 12, wherein an address 

10 indicates a position in a first file (402; 303) associated with the multimedia 
signal. 

14. A method according to any of claims 1 to 13, wherein the multimedia 
signal is stored in a second fife (302). 

15 

15. A method according to any of claims 1 to 14, wherein the additional ) 
content comprises an indication of the type of the coding scheme used for 
encoding the encoded sarnples. 

20 16. A method according to any of claims 1 to 15, wherein the protocol 
complies with the general Musical Instrument Digital Interface (MIDI) 
specification. 

17. A unit for composing a multimedia signal according to a protocol using an 
25 event controlled representation of content In the multimedia signal, where the 
signal (401 ;409) is composed to carry: 

events (406) of a first type which are arranged to carry content in the form of 
instructions to a unit,; and 

events (407) of a second type which are arranged to carry additional 
30 contents: 

wherein the unit comprises: 
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an event-inserter (103) arranged to insert events (406) of the second type 
and to apply additional content (410) to events (406) of the second type, 
wherein the additional content (410) comprises an address of encoded 
samples of multimedia content (402) or samples of multimedia content (402). 

5 

18. A unit for rendering a multimedia signal according to a protocol using an 
event controlled representation of content in the multimedia signal, where the 
signal (401 ;409) is composed to carry: 

events (407) of a first type which are arranged to carry content in the form of 
10 Instructions to a unit; and 

events (406) of a second type which are arranged to carry additional content; 
wherein the unit comprises: 

a parser (201) arranged to identify events (406) of the second type and to 
read the additional content (410); 
15 an interface (204) arranged to load samples of multimedia content and to 
send (205) encoded samples to a decoder to retrieve decoded samples for 
subsequent playback of the multimedia content. 

19. A unit according to claim 17 or 18. wherein the protocol complies with the 
20 general Musical Instrument Digital Interface (MIDI) specification. 

20. A multimedia signal according to a protocol using an event controlled 
representation of content in the multimedia siignal, where the signal 

25 comprises: 

events (407) of a first type which are arranged to carry content in the form of 
instructions to a unit; and 

events (406) of a second type which are arranged to carry additional contents 
(410); 
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wherein the additionai content (410) comprise an address of encoded 
samples of multimedia content (402) or encoded samples of multimedia 
content 

5 21. A multimedia signal according to any of claims 18 to 20, wherein the 
events (406) of the second type comprise System Excluslves events as 
defined in the specification of the Musical Instrument Digital Interface (MIDI). 

22. A multimedia signal according to any of claims 18 to 21, wherein the 
1 0 events (406) of the second type comprise Meta-events as defined in the 

specification of the Musical Instrument Digital Interface (MIDI). 

23. A multimedia signal according to claim 22, wherein the events (406) of 
the second type comprise Meta-events of the type cue-pointe, identified by 

15 the hexadecimal value FF 07. 

24. A multimedia signal according to claim 22, wherein the events (406) of 
the second type comprise Meta-events of the type lyric, identified by the 
hexadecimal value FF 05. 

20 

25. A multimedia signal according to claim 22, wherein the events (406) of 
the second type comprise Meta-events of the type text, identified by the 
hexadecimal value FF 01. 

25 26, A multimedia signal according to any of claims 20 to 25, wherein the 

protocol complies with the general Musical Instrument Digital Interface (MIDI) 
specification. 
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ABSTRACT 

Methods and units of composing or decomposing a multimedia signal 
according to e.g. the Musical instrument Digital Interface (MIDI) protocol 
5 where the signal is composed to carry events of a first type which are 

arranged to carry instructions to a unit of which of predefined patches to use 
for playback and which of predefined notes to play; and events of a second 
type which are identifiable separately from events of the first type and which 
are arranged to carry additional content The method of decomposing a 

10 multimedia signal comprises the step of parsing the signal to identify evente 
of the second type and to read the additional content; loading coded samples 
of multimedia content at an address specified in the additional content; and 
decoding the coded samples to provide decoded samples for playback of the 
multimedia content. Thereby, It is possible to convey vocal song or vocals 

1 5 and other audio type signals in an efficient way by means of the widely used 
MIDI protocol 

(fig. 2 should be published) 
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